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1. Introduction 


Object detection, or more generally pattern detection and recognition, can be based on many 
different principles. The objects can be described through their structure, shape, color, 
texture, etc. [Blaschko & Lampert (2009); Chen et al. (2004); Fidler & Leonardis (2007); Leibe 
et al. (2008); Lowe (1999); Serre et al. (2005); Viola & Jones (2001)]; therefore, a variety of 
object detection mechanisms was developed over time. One of the modern approaches to 
object detection is similarity-based detection where the objects of interest are defined through 
a set of examples and typically also through a set of counter-examples and the decision 
whether an object is an object of interest is done through machine learning-based functional 
block - classifier. The object detection in an image is performed by the application of the 
classifier on sub-windows of the image. 


The focus in this chapter is on statistical binary classifiers whose function is to make a binary 
decision on whether an image region is or is not an object of interest. The methods of interest 
include mainly AdaBoost [Freund (1995); Schapire et al. (1998)] whose original purpose was 
to fuse a small number of relatively well working so-called weak hypotheses into one, better 
working, strong classifier. This approach was further developed into an approach, which 
instead of a small number of weak classifiers, took into account a large number of simple 
functions and selected suitable weak classifiers automatically from these functions. This 
method has been demonstrated in the pioneer work of Viola and Jones [Viola & Jones (2001)]. 


The AdaBoost approach has been further refined and modified [Bourdev & Brandt (2005); 
Li et al. (2002); Sochman & Matas (2004; 2005)]. Perhaps the most important modification 
was by Sochman & Matas (2005), called WaldBoost which was based on Wald's sequential 
decision making [Wald (1947)] combined with AdaBoost. The main advantage of WaldBoost 
is its significant performance gain comparing it to the AdaBoost classifiers with virtually no 
change in classification quality. 


The detection through classification involves the application of the classifier on a selection 
of sub-images of the analyzed image. As the classification results of neighboring 
sub-images may be statistically significantly interdependent, it is worth studying whether the 
inter-dependencies can be exploited to reduce the computational effort through the prediction 
of classifier results in certain sub-images, through suppression of unwanted object detection 


www.intechopen.com 


228 Real-Time Systems, Architecture, Scheduling, and Application 


Fig. 1. Scanning the image with a classifier. Individual sub-images of the image are classified 
by a classifier (Image source: BioID dataset). 


(e.g. multiple detections in very close image locations), or simply through the sharing of 
intermediate results of the calculations. These aspects of object detection are addressed in this 
chapter as well. 


The structure of the chapter is as follows. The next section gives a brief introduction to 
object detection with classifiers. Section 3 discusses properties of features extracted from 
image and describes feature types often used for rapid object detection. Section 4 describes 
the ideas behind AdaBoost and WaldBoost learning procedures. Acceleration methods for 
WaldBoost-based detection are introduced in Section 5. Implementation of the detection 
runtime on different platforms is discussed in Section 6. Some results of the detection 
acceleration are presented in Section 7, and finally we conclude in Section 8 with some ideas 
for future research. 


2. Object detection with classifiers 


Classifiers are suitable for making the decision, whether some sub-images are images of object 
of interest or not. Such functionality is obviously of interest for object detection but it is not 
sufficient on its own. The reason is that for reliable classification, variability of objects of 
interest has to be minimized - the classifiers are trained to detect well-aligned, centered and 
size-normalized objects in the classified sub-image. Therefore, the actual detection of objects 
is performed through a classification of contents of all the sub-windows that can contain the 
object of interest, or simply through classification of all the possible sub-windows. This is 
usually performed by scanning the image with a moving window of a fixed size where the 
content of the window is classified for each location and, if the object of interest is found, the 
location is considered the output of the detection process. 


The above described approach involves, in fact, an exhaustive search for an object of interest 
in the image, where all the sub-images are classified in order to understand whether they 
contain an object of interest or not. While the classification process is in general quite simple 
(as shown in more detail below), sometimes it might be feasible to pre-process the analyzed 
image in order to identify the image parts where the object(s) of interest cannot be present; 
such parts of the image can be excluded from the classification process and the computational 
effort can be reduced. Good examples of such approach are color-based pre-processing, where 
e.g. a flower cannot be present in a part of the image that contains "completely blue sky”; or 
a human face cannot be found in a part of an image that does not contain “skin color”; or 
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geometry-based approaches where it cannot be expected that an airplane would be detected 
below walking people in the image. 


As it is obvious from the above description, detection of objects through AdaBoost/WaldBoost 
methods is dependent on object orientation and size; however, in many applications it is 
desirable to detect objects regardless of their size or orientation. While this requirement is 
difficult or often impossible to handle directly in the AdaBoost/WaldBoost machine learning 
process, the feasible approach is to handle it indirectly through repeating the detection process 
for different scales and/or orientations. The main reason is that in general, the feature 
extraction methods (weak classifiers) are not rotation, scale or shift invariant. Therefore, 
the detection process should be applied repeatedly to sample the rotation, scale, etc. in the 
needed range. The density of image sampling is dependent on the tolerance of the classifier 
to rotation, scale, etc. The tolerance is in general not predictable and depends on the dataset. 


3. Efficient feature extraction 


The performance of the object detection is for the large part influenced by underlying feature 
extraction methods. Two main properties of features extracted from an image exist: a) 
descriptive power and b) computational complexity. The goal in rapid object detection is to 
use computationally simple and, at the same time, descriptive features. In the vast majority of 
cases, these two properties are mutually exclusive and thus there are computationally simple 
features with low descriptive power (e.g. isolated pixels, sums of area intensity) or complex 
and hard to compute features with high descriptive power (Gabor wavelets [Lee (1996)], HoG 
[Dalal & Triggs (2005)], SIFT and SURF [Bay et al. (2008); Lowe (2004)], etc.). A close to 
ideal approach is Viola and Jones [Viola & Jones (2001)] with their Haar features calculated 
in constant time from an integral representation of image. The features used in this chapter 
are Local Binary Patterns (LBP) [Zhang et al. (2007)], Local Rank Patterns (LRP) [Hradis et al. 
(2008)] and Local Rank Differences (LRD) [Zemcik et al. (2007)]. Their main properties are as 
follows. 


e Strict locality — Evaluation is based strictly on local data (i.e. no normalization is needed). 


e Simple evaluation — The input is coefficients extracted from an image by convolution with a 
rectangular kernel. The coefficients are processed by a simple formula. 


Fig. 2. Feature samples for LBP (left), LRD and LRP (right) 


All presented features are based on the same model. The only difference is their evaluation 
function. First, coefficients v; from regular 3 x 3 grid (see Fig. 2) are extracted by convolution. 
The coefficients are processed by an evaluation function producing the response. 
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LBP(v,c) =D o(v; > c)2! (1) 
LRD(v,a,b) = r(va, v) —r(vp, v) (2) 
LRP(v,a,b) = 10r (va, v) + r(vy, v) (3) 


The evaluation of LBP works such that all samples are compared to the central one. The result 
of each comparison is treated as a single bit in the 8 bit code (1). The LRD and LRP features 
are parametrized by indices of two samples whose ranks are calculated (4). The ranks are 
subtracted in the case of LRD or combined together in LRP (2,3). 


r(o,v) = ); 


i=1 


en (4) 
4 


0, otherwise 


The response range of the features is (0, 255) for LBP, (—8,8) for LRD and (0,99) for LRP. The 
response is used as an input to a weak classifier which is essentially a look-up table assigning 
a weak classifier response to a feature response. 


4. AdaBoost and WaldBoost 


AdaBoost [Freund (1995)] and other boosting algorithms [Friedman et al. (2000); Grove & 
Schuurmans (1998); Ratsch (2001); Rudin et al. (2004); Schapire et al. (1998)] all combine weak 
hypotheses hy : x — R into a strong classifier Hy. The combination is a weighted average where 
responses of the weak hypotheses are multiplied by weights « determining their importance: 


T 
Hr(x) = ), (f(x) (5) 


t=1 


The weak hypotheses often internally partition the object space Ø into a set of disjoint areas 
based on a single feature response. Such weak hypotheses are called space partitioning 
weak hypotheses [Schapire & Singer (1999)] and the partition functions f : x — N are 
referred to in the following text simply as features. The weak space partitioning hypotheses 
are combinations of such features and a look-up table function 1 : IN — R 


h(x) = (fi). (6) 
The real value assigned by l; to output j of f; is denoted as ci ) in the text. 

Most of the boosting algorithms order the weak classifiers starting with the most informative 
one and thus it is reasonable to evaluate them in this order and stop when the classification 
decision is certain enough. Such classifiers are called soft cascades [Bourdev & Brandt (2005)] 
and can be formalized as a sequential decision strategy [Sochman & Matas (2005)] 5 which is a 
sequence of decision functions S = $,,55,..., Sr, where S; : IR — t, —1. The evaluation of 
the strategy is terminated with a negative result when a decision function outputs —1. The 
decision functions 5S, decide based on a tentative sum of the weak hypotheses H;, t < T which 
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is compared to a threshold 60: 


| ft, if Hi(x) 50 
Sexe l —1, if Hi(x) < 6 e) 


WaldBoost [Sochman & Matas (2005)] is a method which produces an optimal decision 
strategy for a target false negative rate. The algorithm combines real AdaBoost Schapire & 
Singer (1999) and Wald's sequential probability ratio test Wald (1947). 


Given a weak learner algorithm, training data ((x1,91) ..., (Xm, Vm) x € xy € {-1,4+1} 
and a target false negative rate a, the WaldBoost algorithm finds a decision strategy S* with a 
miss rate as which is lower than « and the average evaluation time Ts = E(arg min;(S; # £)) 
is minimal: 
S* = arg min Ts, s.t. ag < a. 
$ 


WaldBoost uses real AdaBoost to iteratively select the most informative weak hypotheses h+. 
The threshold 6, is then selected in each iteration so that as many negative training samples 
are rejected as possible while asserting that the likelihood ratio that is estimated on training 


data 
i p(Hi(x) < TEE 
p(Hi(x) < bly = +1) 


Ric 
satisfies R; > L, 


5. Acceleration of WaldBoost based object detection 


Acceleration of object detection can be in general based on several principles, the key ones 
being: 


* Implementation on a (more) powerful computational platform - simple general 
improvement of computational platforms 

e exploitation of a structurally different platform compared to the traditional processor 
platform 


e improvement of the AdaBoost/WaldBoost machine learning and/or feature extraction 
algorithms 

e exploitation of redundancy and coherence in results of classification in different (adjacent 
or close) areas of the image. 


The case of general improvement of computational platforms is not of interest here in this 
publication. On the other hand, structurally novel computational platforms are interesting 
in general due to their rapid growth in computer technology and specifically in the object 
detection, where the structure of exploitation of the computational resources suggests that 
the traditional platforms are not ideal and that the massive parallel platforms are also not 
completely suitable. 


The general improvements of the AdaBoost/WaldBoost machine learning methods are 
outside the scope of this publication. However, the algorithmic improvements not connected 
with the classification itself, but rather with the redundancy due to correlation of the 
classification results in different sub-images of the same image, are quite important to 
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investigate. Their exploitation can significantly reduce the computational effort needed for 
object detection. 


5.1 Classification cost and its minimization 


The relative cost of classifier evaluation can be measured and used for the reduction 
of the computational effort by combining two or more different approaches of classifier 
implementation; for example, a hardware pre-processing unit connected to post-processing 
unit on traditional CPU. The minimization method can be applied to various types of relative 
cost (computations, memory, hardware price, etc.) as its formulation is general. In this chapter, 
the interest is in the minimization of the use of computational resources and the relative cost 
thus roughly corresponds to computational time (except when otherwise noted). 


5.1.1 Classifier statistics 


The main property of a classifier is the probability of the evaluation of a weak hypothesis, 
reflecting on how often a weak hypothesis is executed during the detection. This value p can 
be calculated for every stage i from statistics obtained on a dataset of images. Due to the 
rejection nature of WaldBoost classifiers, the sequence of p; decreases and the first stage is 
always evaluated (i.e. the p; = 1). Example of such statistics is shown on the left in Fig. 3. 
The p; captures computational the complexity of the classifier. 


0.1 


0.01 


Stage execution probability 


0.001 


Classification cost [weak hypotheses] 


0.0001 : ; 1. 
1 10 100 1000 1 10 100 1000 


Stage Stage 


Fig. 3. Example of classifier statistics. Left, stage execution probability. Right, number of 
evaluated weak hypotheses on average for particular length of the classifier. 


5.1.2 Cost evaluation 


In the case of AdaBoost/WaldBoost classifiers the total cost C is proportional to the number 
of evaluated weak hypotheses which can be calculated by (8). The T is the length of classifier. 
The k is the overall classifier cost which symbolizes evaluation cost on a particular platform 
on which the classification is implemented. The p is the probability of the execution of a 
particular weak hypothesis (see Section 5.1.1). The c is the relative cost of the weak hypothesis 
evaluation which addresses the possibility that the hypotheses have a different cost (due to 
the use of different features, for example). 


T 
Ceky pu (8) 
E 
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When analyzing real classifiers, p can be obtained from the statistics on input images and c by 
time measurement or other cost estimation and k can be set to a constant value. In Fig 3, the left 
plot shows the value of p; and the right plot the area under the p; curve which is proportional 
to the amount of computational resources needed for the evaluation of the classifier. 


In object detection, the most common are homogeneous classifiers (i.e. those with all weak 
classifiers and features of the same type). In such cases, the cost of hypothesis evaluation is 
constant c; — c. Additionally in AdaBoost, all weak hypotheses are executed every time and 
the probability of executing all hypotheses is equal to p; — 1. The C from (8) can thus be 
simplified to C (AB) (for AdaBoost) and C (WB) (for WaldBoost) in (9). 


n 
CUM cns — QUE LS p (9) 
i=1 


5.1.3 Cost minimization 


The cost of classification is not the only property of the classifier, but it is also the property 
of implementation of the run-time in which the classifier is executed — feature extraction and 
classifier evaluation. Different implementations with different properties exist. Imagine, for 
example, an implementation A which can evaluate very efficiently K > 1 weak hypotheses in 
a row, but it always evaluates all of them no matter how many weak hypotheses is actually 
needed for the evaluation. It could be a pre-processing unit implemented in a hardware which 
rejects areas without an occurrence of the target object. Then, there is implementation B in 
software which can evaluate the classifier in standard way. The computational cost for one 
feature in A is much less than in B but implementing the whole classifier in the hardware is 
hard to achieve due to limited resources. 


O<u<T i=0 i=u 


u—1 T—1 
C = arg min (s . Dic + ko D nani) (10) 


Both implementations can be put together, but the problem is how many weak hypotheses 
have to be put in a hardware unit and how many are left in the software. The precise position 
of division of the evaluation is subject to minimization of classification cost (10) in order to 
find a composition with minimal cost. 


The two-phase classifier can be fine tuned by one parameter. Equation 10 shows the 
minimization problem and Fig. 4 shows values of C for different settings of u. The C is the total 
minimal cost of the evaluation; u is the point of classifier division; and k, c and p correspond 
to the parameters of the cost computation from Equation 8. It should be noted, that although 
the properties p of the classifier are the same for both parts, the p can be in general different 
for each part. This is due to the structure of the evaluation in particular implementation which 
can force different probabilities of feature evaluation (e.g. by evaluating more features in one 
step; see Section 5.1.1). 


When going beyond the example given above, more than two phases of evaluation can be 
used. And minimization problem is thus multi-dimensional. In the general case, described by 
(11), the classifier division is vector u whose values are searched for in order to find the best 
composition of parts with different properties. Note that u; can be equal to uj} and some 
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Total cost 


0 L L 
1 10 100 1000 


Division point u 


Fig. 4. Example of minimization of classification cost for two-phase classifier. The first phase 
always evaluates all weak hypotheses but the cost for a weak hypothesis is 0.1 of the second 
phase. The second phase evaluates weak hypotheses one by one. The black dot marks the 
division between the parts that lead into the minimum cost. 


parts could be in fact skipped when they are evaluated as useless in the optimization. 


M ug —1 
C — arg min | L (i »» pusema) ) 
u m=1 i=Um-1 


s.t. (11) 
up = 0 
Um = T 


Ui Su; OSİ<M 


In practical applications, it is easy to get classifier statistics — it reflects classifier behavior on 
images. On the other hand, it is tricky to identify values of c and k. It has to be done by 
careful examination of performance of the particular implementation of the detection (e.g. by 
the precise measurement of time needed for the execution of weak hypotheses). 


5.2 Exploiting neighbors 


In scanning window object detection using a soft cascade detector, each image position 
is processed independently. However, much information is shared between neighboring 
positions and utilizing this information has a potential for increasing the speed of detection. 


One way to utilize the shared information is to learn suppression classifiers [Zemčík et al. 
(2010)] to predict the responses of the original detection classifier at neighboring positions. 
Computation of the original detector can then be suppressed at positions for which this 
prediction is negative and with enough confidence. 


In the case of space partitioning weak hypotheses (see Section 4), the suppression classifiers can 
be made computationally very efficient by re-using the features h; computed by the original 
classifier. In that case, adding the suppression classifiers just increases the size of the look-up 
table]: IN > R. 


The task of learning the suppression classifiers can be formulated as detector 
emulation [Sochman & Matas (2007); Sochman & Matas (2009)] which allows usage of 
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Already resolved Current 
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Predictions 
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to current prediction 


Positions already 


excluded earlier 


Fig. 5. Neighborhood suppression - during scanning, positions surrounding the currently 
evaluated position can be suppressed. On such positions the classifier will not be computed. 


unlabeled data for training and does not require any modifications in learning the original 
detection classifier. Moreover, previously created detectors can be used as well. 


In the classifier emulation [Šochman & Matas (2007); Sochman & Matas (2009)] approach, 
an existing detector is considered a black box and its decisions are used as labels for a new 
WaldBoost learning problem. The algorithm for learning the suppression classifiers differs 
from this basic scenario in three distinct aspects discussed below. The whole algorithm for 
learning suppression classifier is summarized in Algorithm 1. 


The first change, as mentioned earlier, is that the weak hypotheses h, of a suppression 
classifier, reused features f; of the original detector and only new look-up table functions 
l; are learned. By restricting the features, the learning process is very fast as the selection of 
an optimal weak hypothesis is generally the most time consuming step. 


The second difference is that the labels for training the suppression classifier are obtained 
from a different image position than where the classifier gets information from (the position 
containing the original features l+). This is consistent with the fact that we want to predict 
responses in the neighborhood of the currently evaluated position. 


Finally, the set of training samples is pruned twice in each iteration of the learning algorithm 
instead of only once as in WaldBoost. The samples rejected by the new suppression classifier 
are removed from the training set, as well as, the samples rejected by the original classifier. 
This reflects the behavior during scanning when only those features which are needed by the 
detector to make a decision are computed and, consequently, the suppression classifiers can 
only use these computed features to make their own decision. 


5.3 Early non-maxima suppression 


Detection of objects by a scanning window technique usually employs some kind of 
non-maxima suppression to select a position with the highest classifier response from a small 
neighborhood in position, scale and other possible degrees of freedom. The suppressed 
detections have no influence on the resulting detection and it may not be necessary to compute 
the detectors completely in these positions. In other applications only the highest response 
on a number of samples is of interest as well. Examples of such applications are speaker 
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Algorithm 1 WaldBoost for learning suppression classifiers 


Input: original soft cascade Hr(x) = Y 4 h(x), its early termination thresholds 0'(! and 
its features f; desired miss rate «; training set ((x1,91) ..., (Xm, Ym) , x € xy € {—1, +1}, 
where the labels y; are obtained by evaluating the original detector Hr at an image position 
with a particular displacement with respect to the position of corresponding x; 

Output: look-up table functions l; and early termination thresholds 0’ (Ü of the new 
suppression classifier 

Initialize sample weight distribution D4 (i) = + 
fort — 1,...,T 

1. estimate new l; such that its 


2. add l; to the suppression classifier 


Hi(x) 


| 
UN 

~ 
a ~ 
pum. 
Eu 
a 
eR 
ial 
io tal 


find optimal threshold 0'(^ 
remove from the training set samples for which H;(x) < 0(0 


remove from the training set samples for which H;(x) < 6’ (t) 


noe & 


update the sample weight distribution 


Dij) x exp(—yiH; (%i)) 


and person recognition where a short utterance or face image is matched by a classifier to 
templates from a database. 


The main idea of Early non-Maxima Suppression [Herout et al. (2011)] (EnMS) is to perform 
non-maxima suppression already during computation of classifiers and to stop computing 
classifiers for objects having very low probability to reach the best score in the set of the 
competing objects. 


In the context of soft cascades, EnMS can be formalized as the Conditioned Sequential Probability 
Ratio Test (CSPRT) which allows the decision functions S; (see Equation 7 for the original 
formulation) to be conditioned by some additional data z; € Z: 


o f=) if Hy (x) < 6: (Zt) 
aa = n if O:(zt) < Hi(x) Ue 
Here the threshold becomes a function of the conditioning data. 


In order to create an optimal CSPRT strategy, the threshold functions 6;(z+) should be 
optimized for the same objectives as the thresholds 0, in WaldBoost (see Equation 13). 
Parameters of 0,(z;) should be set so that as many negative training samples are rejected as 
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possible while asserting that the likelihood ratio is estimated on the training data 


o,  P(Hi(x) < (ze) ly = 1) 


Ap p(Hi(x) < 6ey(zi)|y = +1) d 


satisfies R; > a 


For the EnMS approach to be effective, the conditioning information z; has to encode how well 


the other competing samples are classified and the function form of the threshold function 
0, (zi) has to be simple enough to allow reliable estimation of its optimal parameters. 


In our approach, the weak hypothesis h; is evaluated for the whole set of competing samples 
X ata time, and the conditioning information is the maximum tentative classifier response on 
the competing samples 
zt = max(H;(x)). (14) 
xe«x 
We choose 6;(z;) as 
0: (Zt) = Z— Àt. (15) 


With this choice of 0; (z+), the EnMS condition for rejecting samples in Equation 12 becomes 
Hi (x) <zp— A. (16) 


With these choices, EnMS introduces only a very small computational overhead. When 
computed sequentially, a weak hypothesis h; can be computed on all active positions; then 
the maximal responses can be gathered and the samples fulfilling H;(x) < zi — A; can 
be suppressed. When computing positions in parallel, the process has to be synchronized 
before the suppression step and gathering the maximal value may require synchronization, 
atomic instructions or a special value reduction method. However, even in highly parallel 
environments, the possible issues are not that significant as the potential serial operations are 
simple. Furthermore, suppression does not have to be performed after each weak hypothesis 
and the computation does not have to be strictly enforced without any significant performance 
drawbacks. 


6. Runtime design 
6.1 Exploiting SIMD architectures 


The SIMD (Single Instruction Multiple Data) architectures exploit data level parallelism to 
accelerate certain operations. Contrary to instruction parallelism, the data parallelism is 
works so that the CPU performs the same instruction with vectors of data. This approach 
is very efficient in tasks where a simple computation is performed on large amount of data 
(e.g. stream processing). 


Typically, CPUs contain a standard instruction set which processes integers and floats. This 
set is extended with a set of vector instructions which work over vectors of data stored in 
the memory. Vector instructions typically include standard arithmetic and logic instructions, 
instructions for data access and other data manipulation instructions (packing, unpacking, 
etc.). This is the case of general purpose CPUs like Intel, AMD or PowerPC. Beside the general 
purpose CPUs, there are GPGPU (General Purpose Graphics Processing Units), successors of 
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traditional GPUs (purposed to process graphics primitives) that can execute parallel kernels 
over data, and that can be viewed as advanced SIMD processors. 


The SIMD architecture can be used especially to accelerate the following parts of detection. 


e Weak classifier evaluation - the instructions can be used to evaluate multiple weak classifiers. 


e Feature evaluation - the features like LRD, LRP and LBP can be evaluated in a data-parallel 
fashion. 


When evaluating the weak hypotheses in a one-by-one manner, the evaluation of a feature 
can be transformed to SIMD processing so that all feature samples are loaded to registers and 
the response is evaluated by using SIMD instructions instead of a typical implementation by a 
loop [Herout et al. (2009); Juránek et al. (2010)]. This necessarily needs a pre-processing stage 
that transforms an image to a SIMD-friendly form and which allows for simple access to the 
data belonging to a feature - convolution of image. Speed up of this method compared to a 
naive implementation is very high, around 3 to 5, depending on the particular architecture on 
which it is implemented. 


When evaluating multiple hypotheses, the implementations is pretty much the same as for 
one weak hypothesis without SIMD instructions. The difference is that the SIMD registers 
can hold information for more weak hypotheses (16 in the case of Intel SSE). This leads into 
efficient implementation of AdaBoost classifiers. WaldBoost classifiers, on the other hand, can 
be inefficient using this implementation as many weak hypotheses are calculated even when 
they are not necessarily needed for the classifier evaluation. Pre-processing is needed again 
to simplify the data access and feature evaluation. Speed-up achieved by this method is very 
high. In fact, when implementing WaldBoost evaluation, it is comparable to the method in the 
previous paragraph, even though many weak hypotheses are calculated unnecessarily. 


In some cases, the feature response can be pre-calculated for all positions in the image and 
during detection, the feature is extracted by only one access to a pre-calculated image. In 
this case, for each version of a feature, an image with a pre-calculated result must be created. 
This is only possible when a small number of feature variant exist. For example, LBP with 
restricted size to 2 x 2 pixels ber block has four variants. On the other hand, LRD with the 
same restriction has 144 variants (as it is additionally parametrized by A and B indices) and 
calculation of such a high amount of images would be computationally expensive. 


To summarize, benefits brought up by SIMD processing are the following: SIMD allows for 
features to be extracted very efficiently and the performance of a classifier evaluation can evan 
be increased by multiple number of times. On the other hand, the SIMD comes with the need 
of pre-processing which, when implemented without care, can reduce performance. 


6.2 GPU implementation of the detection 


Implementation of object detection in GPU was historically detected using programmable 
shaders [Polok et al. (2008); however, contemporary state of the art is in GP-GPU 
programming languages, such as CUDA or OpenCL [Herout et al. (2011). GP-GPUs 
programmed using one of these languages present one of the most powerful and efficient 
computational devices. When used for object detection, GP-GPUs can be seen as a SIMD 
device with a high level of parallelism. 
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Unfortunately, the high level of parallelism is difficult to employ in WaldBoost detection as 
the amount of computation in adjacent positions in the image is not correlated and in general 
is quite unpredictable, which fact heavily complicates usage of the ALUs in the SIMD device. 


The efficient implementation of object detection using CUDA [Herout et al. (2011)] solves 
the problems of two main domains: the classifier operating on one fixed-size window, and 
parallel execution of this classifier on different locations of the input image. The problem of 
object detection by statistical classifiers can be divided into the following steps: 


* loading and representing the classifier data 
* image pre-processing 

* classifier evaluation 

e retrieving results. 


The constant data containing the classifier (image features' parameters, prediction values 
of the weak hypotheses summed by the algorithm and WaldBoost thresholds) could be 
accommodated in the texture memory or constant memory of the CUDA architecture. These 
data are accessed in the evaluation of each feature at each position, so the demands for access 
speed are critical. Programs that are run on the graphics hardware using CUDA are executed 
as kernels; each kernel has a number of blocks and each block is further organized into threads. 
The code of the threads consumes hardware resources: registers and shared memory; this 
limits the number of threads that can be efficiently executed in a block (both the maximal and 
minimal number of threads). 


One thread computes one or more locations of the scanning window in the image. The image 
pixels (or more precisely, window locations) are therefore divided into rectangular tiles, which 
are solved by different thread blocks. Experiments showed that the suitable number of threads 
per block was around 128. Executing blocks for only 128 pixels of the image would not be 
efficient, so we chose that one thread calculated more than one position of the window —- 
a whole column of pixels in a rectangular tile. A good consequence of this layout is easy 
control of the resources used by one block: the number of threads is determined by the width 
of the tile, and the height controls the whole number of processed window positions by the 
block. The tile can extend over the whole height of the image or just a part of it. In order to 
avoid collisions of concurrently running threads and blocks, atomic increment (atomicInc 
function) of one shared word in the global memory is used for synchronization. This operation 
is rather costly, but the positive detections are so rare that this means of output can be afforded. 
As a consequence, the results of the whole process are at the end available in one spot of the 
global memory, which can be easily made available on the host computer. 


The main property of the CUDA implementation is that the CUDA outperforms the CPU 
implementation mainly for high resolution videos. This can be explained by extra overhead 
connected with transferring the image to the GPU, starting the kernel programs, retrieving 
the results, etc. These overhead operations typically consume constant time independent of 
the problem size, so they are better amortized in high-resolution videos. 


6.3 Programmable hardware 


The runtime for object detection does not necessarily need to be implemented only in software; 
programmable hardware is one of the options as well, namely field programmable gate arrays 
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[Jin et al. (2009); Lai et al. (2007); Theocharides et al. (2006); Wei et al. (2004); Zemtík & Zádník 
(2007)]. While the algorithms of the object detection are in principle the same for software 
and hardware implementation, the hardware platform offers features largely different from 
the software and thus the optimal methods need to implement detection in programmable 
hardware are often different from the ones used in software and, in many cases, the hardware 
implementation may be very efficient. 


The key features that are important for object detection are very different in hardware and 
software, and which are beneficial for hardware implementation, include: 


* massive parallelism achievable with good performance/electrical power ratio 
e variable data path width in hardware adjustable to exact algorithmic needs 
* simple implementation of bit manipulation and logical functions 


* nearly seamless complex control and data flow implementation 
Of course, the hardware implementation also has severe limitations, the most important being: 


* limited complexity of the hardware circuits 
* computational resources for complex mathematical functions expensive 
e memory structures relatively limited 


e in most cases lower clock speed comparing to the processors 


Taking into account the above advantages and limitations of programmable hardware, it can 
be considered for object detection designed specifically for the following cases: 


* low end computational power embedded system with programmable hardware with 
programmable hardware as a co-processor; in this setup, it is expected that the 
programmable hardware performs more or less a complete detection task; 


* high end computational system with programmable hardware as a pre-processing unit; 
this setup is different from the above one with respect to the detection which does not have 
to be done completely in programmable hardware, but rather the hardware is considered a 
resource to relieve the processor of the host system from as much computation as possible, 
and so it is feasible to implement perhaps incomplete but high performance pre-processor 
that reduces the need for computations; 


* acomplete object detection system in programmable hardware that can be combined with 
image pre-processing and where the complete detection task along with some image data 
flow considerations should be implemented. 


Based on the above methods of exploitation, the methods of implementation of object 
detection in programmable hardware can be subdivided into a complete detection and 
pre-processing. 


The typical methods of complete object detection in programmable hardware is feasible 
to implement using a sequential engine, possibly microprogrammable, which performs 
detection location by location, weak classifier by weak classifier until a decision is reached. As 
the evaluation of each weak classifier is relatively complex, the operation of the sequential unit 
is pipelined, so that several instances can be running in parallel. At the same time, different 
locations, in general, require a different number of weak classifiers to be evaluated. These 
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Fig. 6. Block structure of the object detector in programmable hardware (source Zem¢ik & 
Zadnik (2007)). 


facts lead into relatively complex timing and synchronization of processing; however, very 
good performance can be achieved [Zeméik & Zádník (2007)]. 


In a situation, where a complete evaluation of the detection is not required (e.g. in cases 
where a powerful CPU is available) and programmable hardware can be exploited for 
pre-processing, the best approach is probably a synthesis of fixed-function circuits synthesized 
based on results of the machine learning process "on demand" for each classifier. Such 
a synthesized circuit is most efficient when processing a (small) fixed number of weak 
classifiers for every evaluated position. While some of the weak classifiers are in such cases 
evaluated unnecessarily (assuming WaldBoost algorithm), the average price of weak classifier 
implementation is still often much lower than in the sequential machine described above. The 
main advantage of this approach is that all weak classifiers can be evaluated in a parallel way. 
However, as each weak classifier consumes chip resources, only a very small number of weak 
classifiers can be implemented in this way. 


7. Results 


7.1 Classifier cost minimization 


This section gives an example of optimization of classifier performance by the balancing 
amount of computation between a fast hardware pre-processing unit and software 
post-processing unit. The classifiers used in this experiment were face detectors composed 
from 1000 weak hypotheses with LBP features and different false negative error rates (in a 
range from 0.02 to 0.2). 


As a baseline, software implementation working on an integral image was selected, as it is 
the standard way of implementation of the detection. The other implementations ised in the 
experiments were SSE implementation that evaluate features one by one (SSE-A), and the SSE 
implementation that evaluates 16 weak hypotheses in a row (SSE-B). 


The cost of the hardware unit was selected according to the area on the chip taken by the 
design. We set the cost constantly to c; = l where m is the maximal number of hypotheses 
that can be fit in the circuit. In this experiment, we use m = 50. In general, setting the cost 
to a low value, we simply say that the cost of the hardware unit is not of much interest to 
us, and conversely, setting the cost to a large value, we say that the cost of the hardware is 
very important. The cost of the post-processing unit was calculated from the measurement of 
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Cost per weak hyp. 
INTEGRAL (ref.) 0.215 
SSE-A 0.110 
SSE-B 0.070 
FPGA 0.002 


Table 1. Costs of weak hypotheses evaluation in different implementations of detection 
runtime used in the experiment. 
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Fig. 7. Optimization results for classifiers with different false negative rates. Each plot shows 


the total cost of composition of FPGA with a software implementation. The division point is 
on the horizontal axis and the cost on the vertical axis. 


processing time of the implementations of a standard PC, and it corresponds to microseconds 
per weak hypothesis. The cost values are summarized in Table 1. According to selected costs, 
the optimization minimize circuit area and, at the same time, the amount of computations 
in the software. By the combination of such diverse cost measures the result given by the 
optimization can be viewed as a "relative cost", but the interpretation of the value might be 
somewhat problematical. This does not, however, matter too much as we do not care about 
the absolute value of the cost, but about the position of the minima. 


Figure 7 shows four plots of total cost for different classifiers. Each plot shows the value of 
total cost for different settings of the classifier division point and each curve corresponds to a 
particular combination of FPGA and software implementation. The results of optimization for 
a classifier with a = 0.02 are summarized in Table 2. The Division column shows the division 
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Division Best cost| Computations 


Integral 0/1000 1.56 0/1 
SSE-A 0/1000 0.80 0/1 
SSE-B 0/1000 1.24 0/1 


FPGA-Integral| 16/977 0.51 0.87/0.13 
FPGA+SSE-A | 11/988 0.38 0.78/0.22 
FPGA+SSE-B 14/984 0.41 0.85/0.15 


Table 2. Summary of results for classifier with LBP features and « = 0.02. 


of the classifier between hardware and software units; the Best cost column reflects the relative 
cost of the best solution and the Computations column shows the fraction of computations 
performed in hardware and software units. 


This example shows that it can be beneficial to use a combination of more implementations 
of detection instead of one. It turns out that using a hardware pre-processing unit improves 
the detection performance (in terms of computational effort). Additionally, improving the 
performance of the software part allows for using shorter classifiers in hardware. This is an 
important fact as the FPGAs (and especially the cheaper ones) have typically limited resources 
and it could be impossible to put longer classifiers in them. Even higher performance could be 
achieved by using a neighborhood suppression method which would affect stage execution 
probability p in the optimization. This would result in shorter pre-processing units and lower 
total cost. 


The application of such classifier optimization is, for example, in the field of smart camera 
design. The pre-processing module can be placed directly in the camera which then 
outputs, beside the normal image, the image with potential occurrence of target objects. 
Such information, as the above example has shown, dramatically decreases the required 
computation time in the post-processing module. 


7.2 Neighborhood suppression results 


The suppression of neighboring positions was tested on the standard frontal face MIT+CMU 
dataset. Three WaldBoost classifiers with target false positive rates of 0.01, 0.05 and 0.2 were 
trained for four types of image features: LRD, LRP, LBP and Haar. For each classifier, three 
neighborhood suppression strategies were trained with target false positive rates of 0.01, 0.05 
and 0.2. Comparing results of the combinations allows us to evaluate if it is more effective 
to use neighborhood suppression than just by using a WaldBoost classifier with a higher 
false positive rate. The results of this experiment in Fig. 8 clearly show that neighborhood 
suppression is indeed effective and on average it evaluates less weak hypotheses per image 
position for the same accuracy. 


7.3 EnMS results 


EnMS was evaluated on a face localization task. The dataset was downloaded from Flicker 
groups portraits (training) and just faces (testing). The dataset contains 84,251 training and 
6, 704 near-frontal faces. The images were rescaled to a 100 x 100 pixel resolution with the 
face approximately 50 x 50 pixels large and positioned in the middle. Both WaldBoost and 
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Fig. 8. The graphs show AUC on y-axis (Area Under ROC) versus the average number of 
weak classifiers evaluated per image position as measured on the MIT+CMU frontal face 
dataset. The individual lines are for original WaldBoost detectors without neighborhood 
suppression (full line) and the other lines are with added neighborhood suppression with 
different target false negative rates. Good results should be in the left (fast) bottom (accurate) 
corner. 
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Fig. 9. The graphs show frontal face localization error (y-axis) for different speed-ups 
achieved by WaldBoost and EnMS. The speed-up is measured as reduction of the number of 
weak hypotheses evaluated on average per image position relative to the full length of the 
classifier (length is the same for WaldBoost and EnMS). The lines represent differently 
computed errors (see text) of WaldBoost and EnMS. 
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EnMS were evaluated on this data for several target false negative rates. The localization 
accuracy was measured as the number of images where the detector returned a position with 
the highest response of a classifier which always evaluated all weak hypotheses. In order 
to allow for some tolerance, errors were also counted as failure to detect position with the 
reference classifier response lower by 2 and 6 than the best response and failure to detect 
position with the reference response higher than 2 which is an operating point that still gives 
reasonably low false alarms in the detection task. The results in Fig. 9 show that EnMS 
provides approximately two times better speed for the same error rates than WaldBoost. 


8. Conclusions 


This chapter focused on methods of real-time object detection with classifiers. It has been 
demonstrated that the object detection methods working in real-time are feasible and can 
be implemented on a variety of platforms, such as personal computer processors, GP-GPU 
platforms, or even in programmable hardware. 


In order to achieve real-time performance, an efficient implementation platform and efficient 
implementation itself is necessary, but further enhancement through algorithmic acceleration 
is needed as well. Two examples of such acceleration are presented in the chapter: exploitation 
of information about neighborhoods of the already classified positions in the image and 
early suppression of non-maxima of the classifier responses. The approach of exploitation 
of the neighborhoods in the image is based on the idea that classification of the overlapping 
sub-images in the image - the neighborhoods - may share some properties and information. 
One of the possible ways to share such information is through re-using the weak classifiers 
used during classification of one location through WaldBoost for predicting results in the other 
neighboring locations. This prediction is done through a machine learning process similar to 
WaldBoost where the difference to WaldBoost is that the training process actually reuses the 
already selected weak classifiers that were used at the original location. While this process 
works well only in close neighborhoods, it brings a significant speed-up. 


Pre-processing that rules out some parts of the image from the detection process can 
significantly speed up the detection process. Important future research certainly includes 
machine-learning based pre-processing methods and research of under-sampling in scanning 
methods that can also improve detection performance possibly without any adverse effects on 
precision. Future research also includes algorithmic improvements of acceleration methods, 
such as improvement in the processor assignment in GP-GPU, improved scanning trajectories 
in neighborhood exploitation, or further improvements in feature extraction. 
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